{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Theil-Sen Regression\n\n\nComputes a Theil-Sen Regression on a synthetic dataset.\n\nSee `theil_sen_regression` for more information on the regressor.\n\nCompared to the OLS (ordinary least squares) estimator, the Theil-Sen\nestimator is robust against outliers. It has a breakdown point of about 29.3%\nin case of a simple linear regression which means that it can tolerate\narbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional\ncase.\n\nThe estimation of the model is done by calculating the slopes and intercepts\nof a subpopulation of all possible combinations of p subsample points. If an\nintercept is fitted, p must be greater than or equal to n_features + 1. The\nfinal slope and intercept is then defined as the spatial median of these\nslopes and intercepts.\n\nIn certain cases Theil-Sen performs better than `RANSAC\n<ransac_regression>` which is also a robust method. This is illustrated in the\nsecond example below where outliers with respect to the x-axis perturb RANSAC.\nTuning the ``residual_threshold`` parameter of RANSAC remedies this but in\ngeneral a priori knowledge about the data and the nature of the outliers is\nneeded.\nDue to the computational complexity of Theil-Sen it is recommended to use it\nonly for small problems in terms of number of samples and features. For larger\nproblems the ``max_subpopulation`` parameter restricts the magnitude of all\npossible combinations of p subsample points to a randomly chosen subset and\ntherefore also limits the runtime. Therefore, Theil-Sen is applicable to larger\nproblems with the drawback of losing some of its mathematical properties since\nit then works on a random subset.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Author: Florian Wilhelm -- <florian.wilhelm@gmail.com>\n# License: BSD 3 clause\n\nimport time\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom sklearn.linear_model import LinearRegression, TheilSenRegressor\nfrom sklearn.linear_model import RANSACRegressor\n\nprint(__doc__)\n\nestimators = [('OLS', LinearRegression()),\n              ('Theil-Sen', TheilSenRegressor(random_state=42)),\n              ('RANSAC', RANSACRegressor(random_state=42)), ]\ncolors = {'OLS': 'turquoise', 'Theil-Sen': 'gold', 'RANSAC': 'lightgreen'}\nlw = 2\n\n# #############################################################################\n# Outliers only in the y direction\n\nnp.random.seed(0)\nn_samples = 200\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nw = 3.\nc = 2.\nnoise = 0.1 * np.random.randn(n_samples)\ny = w * x + c + noise\n# 10% outliers\ny[-20:] += -20 * x[-20:]\nX = x[:, np.newaxis]\n\nplt.scatter(x, y, color='indigo', marker='x', s=40)\nline_x = np.array([-3, 3])\nfor name, estimator in estimators:\n    t0 = time.time()\n    estimator.fit(X, y)\n    elapsed_time = time.time() - t0\n    y_pred = estimator.predict(line_x.reshape(2, 1))\n    plt.plot(line_x, y_pred, color=colors[name], linewidth=lw,\n             label='%s (fit time: %.2fs)' % (name, elapsed_time))\n\nplt.axis('tight')\nplt.legend(loc='upper left')\nplt.title(\"Corrupt y\")\n\n# #############################################################################\n# Outliers in the X direction\n\nnp.random.seed(0)\n# Linear model y = 3*x + N(2, 0.1**2)\nx = np.random.randn(n_samples)\nnoise = 0.1 * np.random.randn(n_samples)\ny = 3 * x + 2 + noise\n# 10% outliers\nx[-20:] = 9.9\ny[-20:] += 22\nX = x[:, np.newaxis]\n\nplt.figure()\nplt.scatter(x, y, color='indigo', marker='x', s=40)\n\nline_x = np.array([-3, 10])\nfor name, estimator in estimators:\n    t0 = time.time()\n    estimator.fit(X, y)\n    elapsed_time = time.time() - t0\n    y_pred = estimator.predict(line_x.reshape(2, 1))\n    plt.plot(line_x, y_pred, color=colors[name], linewidth=lw,\n             label='%s (fit time: %.2fs)' % (name, elapsed_time))\n\nplt.axis('tight')\nplt.legend(loc='upper left')\nplt.title(\"Corrupt x\")\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.9"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}